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Abstract 

We draw relationships between the generalized data processing theorems of Zakai and Ziv 
ZA ■ (1973 and 1975) and the dynamical version of the second law of thermodynamics, a.k.a. the 

Boltzmann H-Thcorcm, which asserts that the Shannon entropy, H(Xt), pertaining to a finite- 
state Markov process {X t }, is monotonically non-decreasing as a function of time t, provided 
that the steady-state distribution of this process is uniform across the state space (which is the 
case when the process designates an isolated system). It turns out that both the generalized 

^vj | data processing theorems and the Boltzmann H-Theorem can be viewed as special cases of a 

more general principle concerning the monotonicity (in time) of a certain generalized informa- 
tion measure applied to a Markov process. This gives rise to a new look at the generalized data 

r~^. . processing theorem, which suggests to exploit certain degrees of freedom that may lead to better 

bounds, for a given choice of the convex function that defines the generalized mutual information. 

O' 

Index Terms: Data processing inequality, convexity, perspective function, H-Theorem, ther- 
modynamics, detailed balance. 
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1 Introduction 

In [6] , Csiszar considered a generalized notion of the divergence between two probability distribu- 
tions, a.k.a. the f-divergence, by replacing the negative logarithm function, of the classical diver- 
gence, 

D(P 1 \\P 2 )= [dx-P 1 (x)\-\og^\], (1) 

n J L p iw. 

with a general convex functions Q, i.e., 



Dq(Pi\\P2) = J dx ■ P l (x) ■ Q (^y 



(2) 



When the f-divergence was applied to the joint distribution (in the role of Pi) and the product 
of marginals (in the role of P 2 ) of two random variables, it yielded a generalized notion of mutual 
information, 

7«(X; Y) = / dxd, ■ P(x, „) • Q («>) = / dxd, • P(z, „) ■ Q ( Jj&L) , (3, 

which was shown in [6] to obey a data processing inequality, thus extending the well known data 
processing inequality of the ordinary mutual information (see, e.g., Section 2.8]). 

The same ideas were introduced independently by Ziv and Zakai [14j . with the primary moti- 
vation of using it to obtain sharper distortion bounds for classes of simple codes for joint source- 
channel coding (e.g., of block length 1), as well as certain situations of signal detection and es- 
timation (see also [1]). The idea was to define both a "rate-distortion function," R®(d) and a 
"channel capacity," C®, by minimization and maximization (respectively) of the mutual informa- 
tion pertaining to Q, and to derive a lower bound on the distortion d from the data processing 
inequality 

R Q (d) < C Q . (4) 

In the sequel, this will be referred to as the 1973 version of the generalized data processing theorem. 
In a somewhat less well known work [15], Zakai and Ziv have substantially further generalized their 
data processing theorems, so as to apply an even more general information measures, and this will 



1 Originally, this function was denoted by / in [6], hence the name f-divergence. 



be referred to as the 1975 version. This generalized information measure was in the form 

./ dxdB . P(l , B) . (^M,...,^M), 

where Q is now an arbitrary convex function of k variables and {fj,i(x,y)} are arbitrary positive 
measures (not necessarily probability measures) that are defined consistently with the Markov 
conditions and where fii(y\x) = fJ,%(x, y)/P(x). It was shown in [15l Theorem 7.1] that the distortion 
bounds obtained from (J5]) are tight in the sense that there always exist a convex function Q and 
measures {/ij} that would yield the exact distortion pertaining to the optimum communication 
system, and so, there is no room for improvement of this class of bounds^ 

By setting (Xi{y\x) = P(y\xi), i = 1, 2, . . . , k — 1, where {x{\ are k — 1 particular letters in the 
alphabet of X, and Hk(y\x) = P(y), they defined yet another generalized information measure that 
satisfies the data processing theorem as 

E lo f £2™ P (Y\X k ^) P(Y) \ I 

\^\P(Y\X) , ^^^ , P(Y\X) ' P ( Y \ X ) J J ' ^ 

where the expectation is taken w.r.t. the joint distribution 

P(x u ..., x fc _i, x, y) = P{x)P{y\x)P{xi)P{x 2 ) ■ ■ ■ P(x fc _i). 

In both [T3] and [15], there are many examples how these data processing inequalities can be used 
to improve on earlier distortion bounds. 

The data processing theorems of Csiszar and Zakai and Ziv form one aspect of this work. The 
other aspect, which may seem unrelated at first glance (but will nevertheless be shown here to be 
strongly related) is the second law of thermodynamics, or more precisely Boltzmann's H-theorem. 
The second law of thermodynamics tells that in an isolated physical system (i.e., when no energy 
flows in or out), the entropy cannot decrease over time. Since one of the basic postulates of 
statistical physics tells that all states of the system, which have the same energy, also have the 
same probability in equilibrium, it follows that the stationary (equilibrium) distribution of these 
states must be uniform, because all accessible states must have the same energy when the system 



2 This result is non-constructive, however, in the sense that this choice of Q and {/x^} depends on the optimum 
encoder and decoder. 



is isolated. Indeed, if the state of this system is designated by a Markov process, {Xt} with a 
uniform stationary state distribution, the Boltzmann H-theorem tells that the Shannon entropy of 
Xt, H(Xt), cannot decrease with t, which is a restatement of the second law. 

We show, in this paper, that the generalized data processing theorems of [6], [13], and [15] on the 
one hand, and the Boltzmann H-theorem, on the other hand, are all special cases of a more general 
principle, which asserts that a certain generalized information measure, applied to the underlying 
Markov process must be a monotonic function of time. This unified framework provides a new 
perspective on the generalized data processing theorem. Beyond the fact that this new perspective 
may be interesting on its own right, it naturally suggests to exploit certain degrees of freedom of 
the Ziv-Zakai generalized mutual information that may lead to better bounds, for a given choice 
of the convex function that defines this generalized mutual information. These additional degrees 
of freedom may be important, because the variety of convex functions {Q} which are convenient to 
work with, is rather limited. The fact that better bounds may indeed be obtained is demonstrated 
by an example. 

The outline of the remaining part of this paper is as follows. In Section (2] we provide some 
background on Markov processes with a slight physical flavor, which will include the notion of 
detailed balance, global balance, as well as known results like the Boltzmann H-theorem, and 
its generalizations to information measures other than the entropy. In Section [3] we relate the 
generalized version of the Boltzmann H-theorem and the generalized data processing theorems and 
formalize the uniform framework that supports both. This is done, first for the 1973 version [14] 
of the Ziv-Zakai data processing theorem (along with an example), and then for the 1975 version 
by Zakai and Ziv [15]. Finally, in Section [3] we summarize and conclude. 

2 Background 

2.1 Detailed Balance and Global Balance 

Many dynamical models of a physical system describe the microscopic state (or microstate, for 
short) of this system as a Markov process, {Xt}, either in discrete time or in continuous time. In 
this section, we discuss a few properties of these processes as well as the evolution of information 
measures associated with them, like entropy, divergence and more. 



We begin with an isolated system in continuous time, which is not necessarily assumed to have 
reached yet its stationary distribution pertaining to equilibrium. Let us suppose that the state X t 
may take on values in a finite set X. For x,x' G X, let us define the state transition rates 

u/ v Pr ^+<5 = x'\X t = x} , 

W xx i = lim ; x 7^ x (7) 

which means, in other words, 

Vx{X t+5 = x'\X t = x} = W xx , ■ 5 + o(5). (8) 

Denoting 

P t (x) = Vi{X t = x}, (9) 

it is easy to see that 

P t+dt {x) = Y, Pt(x')W x , x dt + P t {x) 1 - ]T W xx ,dt , (10) 

where the first sum describes the probabilities of all possible transitions from other states to state 
x and the second term describes the probability of not leaving state x. Subtracting Pt(x) from 
both sides and dividing by d£, we immediately obtain the following set of differential equations: 

^M = Y,[Pt(x')W x , x -P t (x)W xx/ ], xeX, (11) 

x' 
where W xx is defined in an arbitrary manner, e.g., W xx = for all x G X. In the physics terminology 
(see, e.g., [TO], [12]), these equations are called the master equations^ When the process reaches 
stationarity, i.e., for all x G X, Pt{x) converge to some P(x) that is time-invariant, then 

J2[P(x')W x , x -P(x)W xx ,]=0, Vx^X. (12) 

x' 

This situation is called global balance or steady state. When the physical system under discussion 
is isolated, namely, no energy flows into the system or out, the steady state distribution must be 
uniform across all states, because all accessible states must be of the same energy and the equilib- 
rium probability of each state depends solely on its energy. Thus, in the case of an isolated system, 
P(x) = 1/\X\ for all x £ X. From quantum mechanical considerations, as well as considerations 



3 Note that the master equations apply in discrete time too, provided that the derivative at the l.h.s. is replaced 
by a simple difference, P t +i(x) — Pt(x), and {W^/} are replaced one-step state transition probabilities. 



pertaining to time reversibility in the microscopic levelp it is customary to assume W xx > = W x > x 
for all pairs {x,x'}. We then observe that, not only do ^2 x /[P(x')W x > x — P(x)W xx i] all vanish, but 
moreover, each individual term in this sum vanishes, as 

P(x')W x/x - P(x)W xx , = t^t(W x , x - W xx ,) = 0. (13) 

\<x | 

This property is called detailed balance, which is stronger than global balance, and it means equi- 
librium, which is stronger than steady state. While both steady-state and equilibrium refer to 
situations of time-invariant state probabilities {P(x)}, a steady-state still allows cyclic "flows of 
probability." For example, a Markov process with cyclic deterministic transitions 1 — > 2 — > 3 — > 
1 —> 2 —> 3 —> ■ ■ ■ is in steady state provided that the probability distribution of the initial state 
is uniform (1/3,1/3,1/3), however, the cyclic flow among the states is in one direction. On the 
other hand, in detailed balance {W xx > = W x / X for an isolated system), which is equilibrium, there 
is no net flow in any cycle of states. All the net cyclic probability fluxes vanish, and therefore, 
time reversal would not change the probability law, that is, {X_t} has the same probability law as 
{X t } (see [3 Sect. 1.2]). For example, if {Y t } is a Bernoulli process, taking values equiprobably in 
{ — 1, +1}, then Xt defined recursively by 

X t+1 = {X t + Y t )modK, (14) 

has a symmetric state-transition probability matrix W, a uniform stationary state distribution, 
and it satisfies detailed balance. 

2.2 Monotonicity of Information Measures 

Returning to the case where the process {Xt} pertaining to our isolated system has not necessarily 
reached equilibrium, let us take a look at the entropy of the state 

H(X t ) = -J2P t (x)logP t {x). (15) 

The Boltzmann H-theorem (see, e.g., [3 Chap. 7], [SJ Sect. 3.5], [TO1 pp. 171-173] p21 pp. 624-626]) 
asserts that H{Xt) is monotonically non-decreasing. This result is a restatement of the second law 



4 Consider, for example, an isolated system of moving particles of mass m and position vectors {ri(t)}, obeying the 
differential equations md 2 ri(t)/dt 2 — X/ 7 *^j ^(TjM — r *M)i * — b 2, • ■ • , n, {F(rj(t) —fi(t)) being mutual interaction 
forces), which remain valid if the time variable t is replaced by —t since d 2 ri(t)/dt 2 = d 2 ri(— t)/d(— t) 2 . 



of thermodynamics, which tells that the entropy of an isolated system cannot decrease with time. 
To see why this is true, we next show that detailed balance implies 

»„, 

where for convenience, we denote dPt(x)/dt by Pt{x). Now, 
^^ = - X)[A(x) log Pt(x) + P t (x)} 

X 

= -^P t (x)logP t (x) 

X 

= -Z)Z) W **W) - p t( x )] ^gPt(x)) 

x x' 

= -\Y. W ^ P ^ X> ) - Pt{x)]\ogPt{x)- 

x,x' 

\ E Wx>x[Pt(x) ~ Pt(x')} log P t (x') 

x,x' 

= \Y, W *'*[ p t(x') ~ p t( x )\ ■ tlogPt(x') - log P t (x)] 

x,x' 

> 0, (17) 



where in the second line we used the fact that ^2 x Pt(x) = 0, in the third line we used detailed 
balance (W xx / = W x i x ), and the last inequality is due to the increasing monotonicity of the loga- 
rithmic function: the product [Pt(x') — Pt(x)] ■ [log Pi (a/) — log Pi (x)] cannot be negative for any 
pair (x, x'), as the two factors of this product are either both negative, both zero, or both positive. 
Thus, H{Xt) cannot decrease with time. 

The H-theorem has a discrete-time analogue: If a finite-state Markov process has a symmetric 
transition probability matrix (which is the discrete-time counterpart of the above detailed bal- 
ance property), which means that the stationary state distribution is uniform, then H(Xt) is a 
monotonically non-decreasing sequence. 

A well-known paradox, in this context, is associated with the notion of the arrow of time. On 
the one hand, we are talking about time-reversible processes, obeying detailed balance, but on 
the other hand, the increase of entropy suggests that there is asymmetry between the two possible 
directions that the time axis can be exhausted, the forward direction and the backward direction. 
If we go back in time, the entropy would decrease. So is there an arrow of time? This paradox 



was resolved, by Boltzmann himself, once he made the clear distinction between equilibrium and 
non-equilibrium situations: The notion of time reversibility is associated with equilibrium, where 
the process {Xt} is stationary. On the other hand, the increase of entropy is a result that belongs 
to the non-stationary regime, where the process is on its way to stationarity and equilibrium. In 
the latter case, the system has been initially prepared in a non-equilibrium situation. Of course, 
when the process is stationary, H(Xt) is fixed and there is no contradiction. 

So far we discussed the property of detailed balance only for an isolated system, where the 
stationary state distribution is the uniform distribution. How is the property of detailed balance 
defined when the stationary distribution is non-uniform? For a general Markov process, whose 
steady state-distribution is not necessarily uniform, the condition of detailed balance, which means 
time-reversibility [S], reads 

P(x)W xx , = P(x>)W x , x , (18) 

in the continuous-time case. In the discrete-time case (where t takes on positive integer values only), 
it is defined by a similar equation, except that W xx i and W x > x are replaced by the corresponding 
one-step state transition probabilities, i.e., 

P(x)P(x'\x) = P(x')P(x\x'), (19) 

where 

P(x'\x) = Pr{X t+ i = x'\X t = x}. (20) 

The physical interpretation is that now our system is (a small) part of a much larger isolated 
system, which obeys detailed balance w.r.t. the uniform equilibrium distribution, as before. A well 
known example of a process that obeys detailed balance in its more general form is the M/M/l 
queue with an arrival rate A and service rate \x (A < \x). Here, since all states are arranged along 
a line, with bidirectional transitions between neighboring states only (see Fig. [1]), there cannot be 
any cyclic probability flux. The steady-state distribution is well-known to be geometric 

P(i)=fl-^-^V, x = 0,l,2,..., (21) 

which indeed satisfies the detailed balance P{x)\ = P{x + l)/i for all x. Thus, the Markov process 
{Xt}, designating the number of customers in the queue at time t, is time-reversible. 
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Figure 1: State transition diagram of an M/M/l queue. 

For the sake of simplicity, from this point onward, our discussion will focus almost exclusively on 
discrete-time Markov processes, but the results to be stated, will hold for continuous-time Markov 
processes as well. We will continue to denote by Pt(x) the probability of Xt = x, except that now 
t will be limited to take on integer values only. The one-step state transition probabilities will be 
denoted by {P(x'\x)}, as mentioned earlier. 

How does the H-theorem extend to situations where the stationary state distribution is not 
uniform? In [5j p. 82], it is shown (among other things) that the divergence, 

Pt(x) 



D{P t \\P) = Y J Pt{x)log 

x&X 



P{x) 



(22) 



where P = {P(x), x € X} is a stationary state distribution, is a monotonically non-increasing 
function of t. Does this result have a physical interpretation, like the H-theorem and the second law 
of thermodynamics? When it comes to non-isolated systems, where the steady state distribution is 
non-uniform, the extension of the second law of thermodynamics, replaces the principle of increase 
of entropy by the principle of decrease of free energy, or equivalently, the decrease of the difference 
between the free energy at time t and the free energy in equilibrium. The information-theoretic 
counterpart of this free energy difference is the divergence D{Pt\\P) (see, e.g., [2]). Thus, the 
monotonic decrease of D{Pt\\P) has a simple physical interpretation of free energy decrease, which 
is the natural extension of the entropy increase. Indeed, particularizing this to the case where P is 
the uniform distribution (as in an isolated system), then 



D(P t \\P) = \og\X\-H(X 



t), 



(23) 



which means that the decrease of the divergence is equivalent to the increase of entropy, as before. 
However, here the result is more general than the H-theorem from an additional aspect: It does not 
require detailed balance. It only requires the existence of the stationary state distribution. Note 



that even in the earlier case, of an isolated system, detailed balance, which means symmetry of the 
state transition probability matrix (P(x'\x) = P(x\x')), is a stronger requirement than uniformity 
of the stationary state distribution, as the latter requires merely that the matrix {P(x'\x)} would be 
doubly stochastic, i.e., £^ P{x\x') = Y2 X P(x'\x) = 1 for all x' £ X, which is weaker than symmetry 
of the matrix itself. The results shown in [5] are, in fact, somewhat more general: Let Pt = {Pt(x)} 
and P[ = {P{(x)} be two time-varying state-distributions pertaining to the same Markov chain, 
but induced by two different initial state distributions, {Pq(x)} and {Pq(x)}, respectively. Then 
D(Pt\\Pl) is monotonically non-increasing. This is easily seen as follows: 

X t\ J 

E D/ x D , /, m Pt(x)P{x'\x) 
PAx)P[x \x) log ' ; —3 , : 
v ; v ' ; B P'(x)P(x'\x) 

x,x> tV ' V ' ; 

= 1^ P ( X t = x ' X *+i = » ) log -^prz - 

^ P'{X t = x, Xt+i = x') 

x,x' 

> D(P t+l \\Pl +1 ) (24) 

where the last inequality follows from the data processing theorem of the divergence: the divergence 
between two joint distributions of (Xt,Xt + i) is never smaller than the divergence between corre- 
sponding marginal distributions of Xt+i- Another interesting special case of this result is obtained 
if we now take the first argument of the divergence to the a stationary state distribution: This will 
mean that D(P\\Pt) is also monotonically non-increasing. 

In [9l Theorem 1.6], there is a further extension of all the above monotonicity results, where the 
ordinary divergence is actually replaced by the f-divergence (though the relation to the f-divergence 
is not mentioned in [9]): If {Xt} is a Markov process with a given state transition probability matrix 
{P(x'\x)}, then the function 

U(t) = D Q (P\\P t ) = J2 P(x) • Q (Jt4) (25) 

x£X ^ ^ ' ' 

is monotonically non-increasing, provided that Q is convex. Moreover, U(t) monotonically strictly 

decreasing if Q is strictly convex and {Pt(x)} is not identical to {P(x)}). To see why this is true, 

define the backward transition probability matrix by 

~ P(x)P(x'\x) 

P(x\x ) = p{xl) . (26) 



10 



Obviously, 

£p(x|s') = l (27) 



for all x' G X, and so, 



i+i (x) _ v P t (x')P(x\x>) = v PQr'|x)PQr') 



P(x) ^f P(x) ^f P(x') 

By the convexity of Q: 



<7(* + i) = £P(*).g(^) 

x x ' \ v. . 



= E^')-q(^)=^). (29) 

Now, a few interesting choices of the function Q may be considered: As proposed in (9j p. 19], 
for Q(u) = ulnu, we have U(t) = D(Pt\\P), and we are back to the aforementioned result in [5]. 
Another interesting choice is Q(u) = —Inu, which gives U(t) = D(P\\Pt). Thus, the monotonicity 
of D(P\\Pt) is also obtained as a special caseO Yet another choice is Q(u) = —u s , where s E [0, 1] 
is a parameter. This would yield the increasing monotonicity of ^2 X P 1 ~ s (x)Pf(x), a 'metric' 
that plays a role in the theory of asymptotic exponents of error probabilities pertaining to the 
optimum likelihood ratio test between two probability distributions |13|. Chapter 3]. In particular, 
the choice s = 1/2 yields balance between the two kinds of error and it is intimately related to the 
Bhattacharyya distance. In the case of detailed balance, there is another physical interpretation 
of the approach to equilibrium and the growth of U(t) [H p. 20]: Returning, for a moment, to the 
realm of continuous-time Markov processes, we can write the master equations as follows: 



dP t (x) 



dt ^ R 



E- 



Pt{x') Pt(x) 
P(x') P(x) 



(30) 



5 We are not yet in a position to obtain the monotonicity of D(Pt||P/) as a special case of the monotonicity of 
DQ(P\\Pt). This will require a slight further extension of this information measure, to be carried out later on. 



11 



where R xx > = [P(x')W x > x ]~ l = [P(x)W xx /]~ 1 . Imagine now an electrical circuit where the indices 
{x} designate the various nodes. Nodes x and x' are connected by a wire with resistance R xx / and 
every node x is grounded via a capacitor with capacitance P{x) (see Fig. [2]). If Pt(x) is the charge 
at node x at time t, then the master equations are the Kirchoff equations of the currents at each 
node in the circuit. Thus, the way in which probability spreads across the states is analogous to 
the way charge spreads across the circuit and probability fluxes are now analogous to electrical 



currents. If we now choose Q(u) = s(i , then 



^)4E 



Jf(*) 



2 ^ P(x) 



(31) 



which means that the energy stored in the capacitors dissipates as heat in the wires until the system 
reaches equilibrium, where all nodes have the same potential, Pt{x)/P{x) = 1, and hence detailed 
balance corresponds to the situation where all individual currents vanish (not only their algebraic 
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Figure 2: State transition diagram of a Markov chain (left part) and the electric circuit that emulates the 
dynamics of {P t (x)} (right part). 



We have seen, in the above examples, that various choices of the function Q yield various f— 
divergences, or 'metrics', between {P(x))} and {Pt(x)}, which are both marginal distributions of 
a single symbol x. What about joint distributions of two or more symbols? Consider, for example, 
the function 

^ V P(X =x,X t = x') J 

where Q is convex as before. Here, by the same token, J{t) is the f-divergence between the joint 
probability distribution {P(Xq = x,X t = x')} and the product of marginals {P(Xq = x)P(X t = 
a/)}, namely, it is the generalized mutual information of [6],[T3], and [15], as mentioned in the 
Introduction. Now, using a similar chain of inequalities as before, we get the non-decreasing 
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monotonicity of J(t) as follows: 

J(t)= Y P(X = x,X t = x',X t+1 =x")x 
fP(X = x)P(X t = x') P(X t+1 =x"\J 



P(X = x,X t = x>) P(X t+1 = x"\X t = x>) 
= J2 p ( x o = x, X t+1 = x") Y P{X t = x'\X = x, X t+1 = x") x 

x,x" x' 

( P(X = x)P(X t = x',X t+1 = x") \ 
^\ P(X = x,X t = x',X t+1 = x") J 

< Y p ( x o = x, X t+l = x") Q\Y P ( X t = x '\ X o = x > X t+i = x ") x 

x,x" \ x' 

P(X Q = x)P{X t = x',X t +i = x") \ 
P(X = x,X t = x',X t+1 =x") J 

-VPfJT -rX - t»)Q fy P(X ° = x)P{Xt = X '> Xt+1 = X "A 

t^u V P(X = x,X t+1 = x") ) 

= J(t + l). (33) 

This time, we assumed only the Markov property of (Xq, Xt,Xt+i) (not even homogeneity). This 
is, in fact, nothing but the 1973 version of the generalized data processing theorem of Ziv and Zakai 
], which was mentioned in the Introduction. 



3 A Unified Framework 

In spite of the general resemblance (via the notion of the f-divergence) , the last monotonicity result, 
concerning J(t), and the monotonicity of D(Pt\\P(), do not seem, at first glance, to fall in the 
framework of the monotonicity of the f-divergence Dq(P||Pj). This is because in the latter, there 
is an additional dependence on a stationary state distribution that appears neither in D(Pt\\P[) nor 
in J(t). However, two simple observations can put them both in the framework of the monotonicity 
of D Q (P\\P t ). 

The first observation is that the monotonicity of U(t) = Dn(P\\Pt) continues to hold (with a 
straightforward extension of the proof) if Pt(x) is extended to be a vector of time varying state 
distributions (P^ (x) , PJ* (x) , . . . ,P 4 fc (rr)), and Q is taken to be a convex function of k variables. 
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Moreover, each component P£(x) does not have to be necessarily a probability distribution. It can 
be any function n\{x) that satisfies the recursion 

/4+i(z) = ^2ri(x>)P(x\x>), l<i<k. (34) 

x> 

Let us then denote fi t (x) = (fj,}(x), /j% (x), . . . , $ (x)) and assume that Q is jointly convex in all its 
k arguments. Then the redefined function 



i4( x ) 

P{x) ' " ' ' P(x) 



xex 



T,^>M^-M) <*> 



is monotonically non-increasing with t. 

The second observation is rooted in convex analysis, and it is related to the notion of the 
perspective of a convex function and its convexity property [3J. Here, a few words of background 
are in order. Let Q(u) be a convex function of the vector u = (m, . . . ,Uf.) and let v > be an 
additional variable. Then, the function 

r\( \ A n f Ul U2 Uk \ (ik\ 

Q(v,ui,u 2 ,...,Uk) =v-Q —,—,...,— (36) 

V v v v J 

is called the perspective function of Q. A well-known property of the perspective operation is 
conservation of convexity, in other words, if Q is convex in u, then Q is convex in (v, u). The proof 
of this fact, which is straightforward, can be found, for example, in [JJ p. 89, Subsection 3.2.6] (see 
also [7]) and it is brought here for the sake of completeness: Letting Ai and A2 be two non-negative 
numbers summing to unity and letting («i,tti) and (^2,1*2) be given, then 



<9(Ai(vi,iti) + A 2 (v 2 ,u 2 )) = (Aivi + A 2 u 2 ) • Q 



Ai«i + A 2 tt2\ 
A1U1 + A 2 v 2 / 



= (Ai«i + A 2 f 2 • Q t — : + t — -v 

VA1V1+A2V2 v\ X\Vi + A 2 W2 w 2 

= AiQ(vi,Ui) + \2Q(V2,U2). (37) 

Putting these two observations together, we can now state the following result: 
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Theorem 1 Let 

where Q is a convex function of k variables and {^1(x)}^ = q are arbitrary functions that satisfy the 

recursion 

A+lix)=Y J lA{x')PW), i = 0,l,2,...,*:, (39) 

x> 

and where fJ>f(x) is moreover strictly positive. Then, V(t) is a monotonically non-increasing func- 
tion oft. 

Using the above mentioned observations, the proof of Theorem [1] is straightforward: Letting P 
be a stationary state distribution of {^t}, we have: 

m = y>? ( x)Q(4M,4M .,4M' 

= E^,q(^,^,...,^). («, 

Since Q is the perspective of the convex function Q, then it is convex as well, and so, the mono- 
tonicity of V(t) follows from the first observation above. It is now readily seen that both D(Pt\\P[) 
and J(t) are special cases of V(t) and hence we have covered all special cases seen thus far under 
the umbrella of the more general information functional V(t). 

It is important to observe that the same idea exactly can be applied, first of all, to the 1973 
version of the Ziv-Zakai data processing theorem (regardless of the above described monotonicity 
results concerning Markov processes): Consider the generalized mutual information functional 

•"<*n± £*<..»» (*|$), (4i) 

where no(x,y) > and fj,\(x,y) are arbitrary functions that are consistent with the Markov condi- 
tions, i.e., for any Markov chain X — >■ Y — > Z, these functions satisfy 

fii{x,z) =^2fii{x,y)P(z\y) =^2fii(y,z)P(x\y), i = 0, 1. (42) 
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Then, J®(X;Y) satisfies a data processing inequality, because, again 

~ P(x,y) \no(x,y)/P(x,y)J 



E^(^.fM). <«> 



,P(x,y) ' P{x,v) 

which is a Zakai-Ziv information functional of the 1975 version |T5] and hence it satisfies a data 
processing inequality. 

What functions, fj>o(x,y) and fi\(x,y), can be consistent with the Markov conditions? Two such 
functions are, of course, fj,o(x,y) = P(x,y) and /j,i(x,y) = P(x)P(y), which bring us back to the 
1973 Ziv-Zakai information measure. We can, of course, swap their roles and obtain a generalized 
version of the lautum information |11| , which is also known to satisfy a data processing inequality. 
For additional options, let us consider a communication system, operating on single symbols (block 
length 1), where the source symbol u is mapped into a channel input x = f(u), by a deterministic 
encoder /, which is then fed into the channel P(y\x), and the channel output y is in turn mapped 
into the reconstruction symbol v = g{y). As is argued in [15j . the function n(u,y) = P(u)P(y\uo) 
is consistent with the Markov conditions for any given source symbol uq. Indeed, since the encoder 
is assumed deterministic, P(y\uo) = P(y\f(uo)) = P(y\xo), and it is easily seen that 

ft(u, v) = P(u)P(v\uo) = J^ P(u)P(y\u )P(v\y) = £ ft(u, y)P{v\y) (44) 

y v 

and 

fi(u,y) = P(u)P(y\u ) 

= ^P(n|x)P(x)P(y|n ) 

X 

= ]T P(u\x)P(x)P(y\x ) = Y, P(u\x)/i(x, y). (45) 

X X 

Of course, every linear combination of all these functions is also consistent with the Markov condi- 
tions. Thus, we can take 

Vo(x, y) = s P(x, y)+Y, SiP(x)P(y\xi) (46) 

Xi£X 

and 

fn{x, y) = t P{x, y)+J2 UP(x)P(y\xi), (47) 

XidX 
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where {sj} and {£,} are the (arbitrary) coefficients of these linear combinations (with the limitation 
that Si > for all i, with at least one Si > 0). Thus, we may define 



x,y 
or, equivalently, 



soP(x, y) + ^ s»P(x)P(s/|2;» 

^6* 



Q 



s P(x, i/) + 21 ex s i P{x)P{y\x i ) 



(48) 



J«(X ; y) = ]Tp(*) 



,VA) 



s P(y\x) + ^ SjP(j/|xi) 



Q 



ft P(y\x) + Tex^ p (y\^) 



ys P(y\x) + Y JXl ax SiP(y\i 



(49) 



Moreoever, to eliminate the dependence on the specific encoder, we can think of {xi} as independent 
random variables, take the expectation w.r.t. their randomness (in the same spirit as in [H]), and 
obtain the following information measure 

t P{y\x) + Y Jl UP(y\X i 



eIE^ 



x,y 



s P(y\x) + J2 s i P (y\ X i 



Q 



soP(y\x) + EiSiP(y\Xi 



(50) 



where the expectation is w.r.t. the product measure of {Xi}, P(xi,X2, ■ ■ •) = Y\iP(xi)- These are 
the most general information measures, that obey a data processing inequality, that we can get 
with a univariate convex function Q. For example, returning to eq. ()49|) and taking so = 1, to = 0, 
Si = sP{xi) (s > 0, a parameter), and U = P(xi), x\ G X, we have /io(x,y) = P(x,y) +sP(x)P(y), 
and ni(x,y) = P(x)P(y), and the resulting generalized mutual information reads 

P(y) 

r \x)\r yyyx) t sjt y y)i ■ H i — — - 

x,y 



JQ(X;Y) = ^2P(x)[P(y\x) + sP(y)] ■ Q 



P(y\x) + sP(y) 



(51) 



The interesting point concerning these generalized mutual information measures is that even if we 
remain in the framework of the 1973 version of the Ziv-Zakai data processing theorem (as opposed 
to the 1975 version), we have added an extra degrees of freedom (in the above example, the param- 
eter s), which may be used in order to improve the obtained bounds. If the inequality RP(d) < C® 
can be transformed into an inequality on the distortion d, where the lower bound depends on s, 
then this bound can be maximized w.r.t. the parameter s. If the optimum s > yields a distortion 
bound which is larger than that of s = 0, then we have improved on [J3] for the given choice of 
the convex function Q. Sometimes this optimization may not be a trivial task, but even if we can 
just identify one positive value of s (including the limit s — > oo) that is better than s = 0, then 
we have improved on the generalized data processing bound of [H], which corresponds to s = 0. 
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This additional degree of freedom may be important, because, as mentioned in the Introduction, 
the variety of convex functions {Q} which are convenient to work with, is somewhat limited (most 
notably, the functions Q(z) = z 2 , Q(z) = l/z, Q(z) = —\fz and some piecewise linear functions 
j,|15j). The next example demonstrates this point. 



Example. Consider the information functional (|5ip with the convex function Q(z) = —yfz. Then, 
the corresponding generalized mutual information is 



J*P;V) -JP(.)[ W ) + tf (,)] • j p{v ^\ p{v) 
= Y1 P(u)\/ p ( v )[P(v\u) + sP(v)} 



= -E P (u)P(v)J S + ^. (52) 

u,v V * ' 

Consider now the above-described problem of joint source-channel coding, for the following source 
and channel: The source is designated by a random variable U, which is uniformly distributed over 
the alphabet U = {0, 1, . . . ,K — 1}. The reproduction variable, V, takes on values in the same 
alphabet, i.e., V = U = {0, 1, . . . , K — 1} and the distortion function is 

{0 v = u 
1 v = ( u + l)modif (53) 

oo elsewhere 

which means that errors other than v = (u + l)mod-KT are strictly forbidden. Therefore the channel 

from U to V must be of the form 

{1 — e u v = u 
e u v = (u + l)mo&K (54) 

elsewhere 

where {e u } are parameters taking values in [0, 1] and complying with the distortion constraint 

1 K ~ l 
E{d(U,V)} = - Y, ^ < d. (55) 

u=0 

The channel is a noise-free L-ary channel, i.e., its input and output alphabets are X = y = 
{0, 1, . . . , L — 1} with P{y\x) = 1 for y = x, and P(y\x) = otherwise. 

Obviously, the case K < L is not interesting because the data can be conveyed error-free by 
trivially connecting the source to the channel. In the other extreme, where K > 2L, there must be 
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some channel input symbol to which at least three source symbols are mapped. In such a case, it is 
impossible to avoid at least one of the forbidden errors in the reconstruction. Thus, the interesting 
cases are those for which L < K < 2L, or equivalently, 6 G (1, 2], where 8 = K/L. 

We next derive a distortion bound based on the generalized data processing theorem, in the 
spirit of |14] and [15] , where we now have the parameter s as a degree of freedom. 

As for the source, let us suppose that in addition to the distortion constraint, we impose the 
constraint that the distribution of the reproduction variable V, just like U, must be uniform over 
its alphabet, namely, P(v ) = 1/K for all v £ V. In this case, 



-J Q (U;V) = Y,P(u)P(v)Js + 



P(v\u) 
P(v) 



1 K-\ 

j^Yl [V* + Keu + Vs + K(l - e u ) + (K- 2)^~s 



u=0 
K-l 



< 



-^2 5Z [VsTkT u + ^s + K{l-e u ) 
^ ■ K \^Js + Kd+yJs + K{l-d) 



1 ~K ] 



K 2 

1 

K 



l ~K ] 



Vs + Kd + y/s + K(l - d) + ( 1 - — ) Vs, 



where the inequality follows from the fact that the maximum of the concave function 



Y^W S + Ke ^ + V s + K ( l ~ e «)]' 



subject to the distortion constraint (|55p . is achieved when e u = d for all u £ hi. Thus, 



(56) 



1 



pfi(d) = -— VTTKd - Vs + K{1 - d) 
K . 



1--Ws. 



(57) 
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As for the channel, we have: 



-jQ(X;Y) = Y,P(x)P(y)\ h + 



.i--.fi 



P{y\x) 
P(y) 



E P(x)P(x')VS + E P\x)^s + p*- 

s v 



x'^x 



+e p W s+ p! 



P(x) 



v^+E p2 ( 



P(x) 
1/P(x) 



V^ + 1/P(x) + ^ 
P(x) 



y/s + T/P{x) + yfi 



(58) 



The function f(t) = t/[y/s + 1/t + \/i] is convex in £ (for fixed s) since f"(t) > for all t > 0, as 
can readily be verified. Thus, —J^(X;Y) is minimized by the uniform distribution P(x) = 1/L, 
Vx, which leads to the 'capacity' expression: 



C Q = -v^- 



1 



^/i + Vs + L' 



Applying now the data processing theorem, 

R Q {d) < C Q , 
we obtain, after rearranging terms 



K 



Vs + Kd + y/s + K(l - d) > - — + lyfl. 

Vs + V s + L 



Squaring both sides, we have: 



2s + K + 2y / (s + Kd)[s + K(l-d)] > 



K 



or 



2^/(s + Kd)[s + K(l-d)} > 



K 



y^S + ^S + L 



+ 2^fs 



+ 2^s~ 



2s -K, 



(59) 



(60) 



(61) 



(62) 



(63) 



which after squaring again and applying some further straightforward algebraic manipulations, 
gives eventually the following inequality on the distortion d: 



4d(l - d) > i/>(s), 



(64) 
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where 



i>{s) = ^2 



K 



+ 2^fs 



2s -K 



4s{s + K) 
K 2 ' 



(65) 



.y/H + y/s + L 

The resulting lower bound on the distortion is the smaller of the two solutions of the equation 
4d(l — d) = i/j(s), which is 

A 1 1 , 

(66) 



ds = ^ - 2^/ l -^{s). 



Thus, the larger is ip(s), the better is the bound. The choice s = 0, which corresponds to the usual 
Ziv-Zakai bound for Q(z) = —y/z, yields 



^(0) 



1 


\f K \ 2 









- H 


< 2 


[\VlJ 





K 

T 



l 



i 



(67) 



However, it turns out that s = is not the best choice of s. We next examine the limit s — > oo. 
To this end, we derive a lower bound to ip(s) which is more convenient to analyze in this limit. 
Note that for s > L/8, it is guaranteed that the expression in the square brackets of the expression 



defining ip(s), is positive, which means that an upper bound on \Js + L would yield a lower bound 
to ip(s). Thus, upper bounding \J s + L by 



Vs + L = Vs • a/1 + L/s < v^( 1 + — ) , 



L 



we get 



K 2 ij)(s) 



K 



> 



x/s~+Vs + L 

K 

v /i(2 + L/2s) 



+ 2^fs 



+ 2^fs 



2s -K 



2s -K 



As 2 - 4Ks 



4s l - 4Ks 



K 2 



As-LY 16KV SKLs \QK 2 s 2 8K 3 s(4s - L) 



4s + L 



+ 



+ 



+ 



[4s + L) A 4s + L {4s + L) 2 (4s + L) 



(68) 



where between the second and the third lines, we have skipped some standard algebraic operations. 
Taking now the limit s — > oo, we obtain 



^oo = lim ^o(s) = -^{K 2 + - 2KL + K 2 + 0) = 2[1-^ 
which yields a better bound than the bound of s = since 

2 (i- 1 -\ >{e -i) 2 



2.1-1 



(69) 



(70) 
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for all Oe (1,2). 

It is interesting to compare this also to the classical data processing theorem: Since 

R(d) = log K-h 2 {d) (71) 

and 

C = \ogL, (72) 

then the ordinary data processing theorem yields the bound 

h 2 {d)> logfl. (73) 

Since 

h 2 (d) >4d(l-d) (74) 

and 

2(l-£)>log 2 (75) 

within the relevant range of 9, the bound pertaining to s — > oo is also better than the classical 
bound for this case. This completes the description of the example. □ 

Finally, we should comment that the monotonicity result concerning V{t) contains as special 
cases, not only the H-theorem, as well as all other earlier mentioned monotonicity results, but 
also the 1975 Zakai-Ziv generalized data processing |15j . Consider a Markov chain U — > V — > W, 
where U, V and W are random variables that take on values in (finite) alphabets, U, V, and W, 
respectively. Let us now map between the Markov chain (U,V,W) and the Markov process {X{\ 
in the following manner: (u, v ) £U 6 V is assigned to the state x' of the process at time t, whereas 
(n, w) G U G W correspond^ to x at time t+1. Now, defining accordingly, 

3(x') = P(u,v), (76) 

H]{x>) = P(u)P(v), (77) 

M?+i(x) = P(u,w), (78) 



6 While V and W may be different (finite) alphabets, x and x' , of the original Markov process, must taken on values 
in the same alphabet. Assuming, without loss of generality, that V = {1, 2, . . . , |V|} and W = {1, 2, . . . , |W|}, then 
for the purpose of this mapping, we can unify these alphabets to be both {f , 2, . . . , max{|V|, |W|}} and complete the 
missing elements of the extended transition matrix P(w\v) in a consistent manner, according to the actual support 
of each distribution. We omit further technical details herein. 
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and 

/4fi(aO = P(«)P(uO, (79) 

then due to the Markov property of (U,V,W), both measures satisfy the recursion with P(w\v) 
playing the rolqj of P(x\x'). I.e., 

P(u,w) =l4+i(x) 

= X>V)P(^0 

x> 

= ^2P(u,v)P(w\v) (80) 

v 

and 

= 5^i>(u)P(v)P(tt;|i/) (81) 

i> 

Thus, for Q{z) = — lnz, the monotonicity of V(i) is nothing but the data processing of the classical 

mutual information. For a general function Q of one variable (k = 1), this gives the generalized 

data processing theorem of [13] • Furthermore, letting Q be a general convex function of k variables, 

and ${x') = P(u,v) as before, we get the more general form of the data processing inequality of 

The above extension of the H-theorem gives rise to a seemingly more general data processing 
theorem than in [15J, as it is not necessary to let fit( x ) be- the actual joint probability distribu- 
tion. However, when looking at the entire class of convex functions with an arbitrary number of 
arguments, this is not really more general, as the corresponding generalized mutual information 
can readily be transformed back to the form of the 1975 Zakai-Ziv information functional using 
again the perspective operation. Indeed, as mentioned in the Introduction and shown in [151 Theo- 
rem 7.1], the class of generalized mutual information measures studied therein cannot be improved 
upon in the sense that there always exist choices of Q and {[ii} that provide tight bounds on the 
distortion of the optimum system. 



7 Consider the component u of x' = (it, v) and x = (u, w) simply as an index. 



23 



4 Summary and Conclusion 

The main contributions of this work can be summarized as follows: First, we have establisehd a 
unified framework and a relationship between (a generalized version of) the second law of thermo- 
dynamics and the generalized data processing theorems of Zakai and Ziv. This unified framework 
turns out to strengthen and expand both of these pieces of theory: Concerning the second law of 
thermodynamics, we have identified a significantly more general information measure, which is a 
monotonic function of time, when it operates on a Markov process. As for the generalized Ziv-Zakai 
data processing theorem, we have proposed a wider class of information measures obeying the data 
processing theorem, which includes free parameters that may be optimized so as to tighten the 
distortion bounds. 
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